Introduction

Welcome to the documentation for hallucination evaluation in machine learning models. Hallucination, in the context of AI, refers to the generation of content that is not grounded in the input data. This documentation covers two types of hallucination tests: one for summarization models and another for Retrieval-Augmented Generation (RAG) models.

Hallucination Evaluation for Summarization Models

NLI Consistency

The NLI consistency measures the logical consistency between an input text (or document) and a model-generated summary. This evaluation is conducted by providing a summarization model with a set of input documents and scoring each (document, summary) pair as being entailing, contradicting, or neutral using a Natural Language Inference (NLI) model:

Entailment: Indicates that the summary logically implies the content in the document.
Contradicting: Indicates that the summary contradicts the content in the document.
Neutral: Indicates that no logical relationship can be drawn between the summary and the document.

The NLI consistency is based on an entailment probability.

Entailment Probability: A score between 0 and 1, representing the average degree to which the summaries logically imply the contents of the input text. A score of 0 indicates a low degree of entailment and a high degree of hallucination, while a score of 1 indicates a high degree of entailment and a low degree of hallucination.

UniEval Factuality

The UniEval factuality evaluation test measures the degree of factual support between an input text (or document) and a model-generated summary. The UniEval factuality score is calculated by providing a summarization model with a set of input documents and scoring each (document, summary) pair using another Large Language Model (LLM) as an evaluator. This LLM is trained on boolean question-answer prompts to evaluate factual support and has been found to significantly outperform various state-of-the-art evaluators.

UniEval Factuality Score: A score between 0 and 1, with 0 implying a low degree of factuality and a high degree of hallucination, and 1 indicating a high degree of factuality and a low degree of hallucination.

RAG Hallucination

Retrieval Relevance

Retrieval relevance represents the relevance of the documents retrieved from the vector database using the embedding model for each query. To measure the retrieval relevance, DynamoFL generates a Retrieval Relevance Label from an LLM.

Retrieval Relevance Label: A binary label with 0 implying not relevant and 1 indicating relevant.

Response Faithfulness

Response Faithfulness represents the faithfulness of model generated responses to the retrieved documents. To measure the response faithfulness, DynamoFL uses Natural Language Inference (NLI) model to generate a Response Faithfulness Label.

Response Faithfulness Label: A binary label with 0 implying not faithful and 1 indicating faithful.

Response Relevance

Response Relevance represents the relevance of model-generated responses to the query. To measure the response relevance, DynamoFL generates a Response Relevance Label based on an LLM.

Response Relevance Label: A binary label with 0 indicating not relevant and 1 indicating relevant.

Hallucination Evaluation for Summarization Models​

NLI Consistency​

UniEval Factuality​

RAG Hallucination​

Retrieval Relevance​

Response Faithfulness​

Response Relevance​